[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

phantom5125 · 2025-08-07T19:00:53Z

Why are these changes needed?

Since some of our metrics are permanently stored in Prometheus, that might cause the /metrics endpoint to become slow or time out, we need a lifecycle-based cleanup.

Related issue number

Closes #3820

End-to-end test example

$kubectl apply -f ray-operator/config/samples/ray-job.sample.yaml

# $kubectl port-forward <kuberay-operator-pod-name> 8080:8080

$curl -s 127.0.0.1:8080/metrics | grep kuberay_
# HELP kuberay_cluster_condition_provisioned Indicates whether the RayCluster is provisioned
# TYPE kuberay_cluster_condition_provisioned gauge
kuberay_cluster_condition_provisioned{condition="true",name="rayjob-sample-clwvk",namespace="default"} 1
# HELP kuberay_cluster_info Metadata information about RayCluster custom resources
# TYPE kuberay_cluster_info gauge
kuberay_cluster_info{name="rayjob-sample-clwvk",namespace="default",owner_kind="RayJob"} 1
# HELP kuberay_cluster_provisioned_duration_seconds The time, in seconds, when a RayCluster's `RayClusterProvisioned` status transitions from false (or unset) to true
# TYPE kuberay_cluster_provisioned_duration_seconds gauge
kuberay_cluster_provisioned_duration_seconds{name="rayjob-sample-clwvk",namespace="default"} 1259.406597953
...

After we clean CR, there will be no more metrics

$kubectl delete rayjob rayjob-sample
$curl -s 127.0.0.1:8080/metrics | grep kuberay_

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

troychiu · 2025-08-07T21:37:13Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

phantom5125 · 2025-08-08T02:39:10Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

troychiu · 2025-08-09T08:57:28Z

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

phantom5125 · 2025-08-09T17:31:28Z

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

Ok, I will take your suggestion and update the PR soon!

phantom5125 · 2025-08-10T05:51:08Z

@troychiu PTAL, thanks!

troychiu

thank you for the contribution!

ray-operator/controllers/ray/metrics/ray_cluster_metrics.go

ray-operator/controllers/ray/raycluster_controller.go

ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go

troychiu · 2025-08-13T23:24:43Z

ray-operator/test/support/metrics.go

+
+// CreateAndExecuteMetricsRequest is a test helper that creates an HTTP GET request to the /metrics endpoint,
+// executes it against a Prometheus handler using the provided registry, and returns the request, response recorder, and handler.
+func CreateAndExecuteMetricsRequest(t *testing.T, reg *prometheus.Registry) (*http.Request, *httptest.ResponseRecorder, http.Handler) {


It's a bit hard for me to understand the usage of this helper function. It not only calls ServeHTTP to send the request but also returns the handler so that it can be used again. For a test case having multiple request sent, i think it's a bit confusing.

#3923 (comment)
@win5923 What do you think? I can accept either way

I think having a helper function makes sense, but the functionality of the helper is a bit confusing. It would be clearer if it either only returns a handler that caller can reuse or it just sends the request for the caller. Let me know if you feel confused!

I agree with Troy’s point. we can simply just sends the request and returns the response recorder.

WDYT?

// ExecuteMetricsRequest executes a GET request to /metrics and returns the response recorder func ExecuteMetricsRequest(t *testing.T, handler http.Handler) (*http.Request, *httptest.ResponseRecorder) { t.Helper() req, err := http.NewRequestWithContext(context.Background(), http.MethodGet, "/metrics", nil) require.NoError(t, err) rr := httptest.NewRecorder() handler.ServeHTTP(rr, req) return rr }

@win5923 I just refactored in the latest commit with:

func GetMetricsResponseAndCode(t *testing.T, reg *prometheus.Registry) (string, int) { t.Helper() req, err := http.NewRequestWithContext(t.Context(), http.MethodGet, "/metrics", nil) require.NoError(t, err) rr := httptest.NewRecorder() handler := promhttp.HandlerFor(reg, promhttp.HandlerOpts{}) handler.ServeHTTP(rr, req) return rr.Body.String(), rr.Code }

Does it look ok? I found it not really necessary to reuse the handler and we only care the response message & code in the test code? cc @troychiu

Sure, I think this is better.
Let's just wait for Troy's comment. Thanks!

ray-operator/controllers/ray/metrics/ray_cluster_metrics_test.go

phantom5125 marked this pull request as draft August 7, 2025 19:02

kevin85421 assigned troychiu Aug 9, 2025

phantom5125 added 2 commits August 10, 2025 02:01

[Feature] Add cleanup for terminated RayJob/RayCluster metrics

a0c9341

[Tests] Add test code

2623dd6

phantom5125 force-pushed the master branch from 8df4ef7 to 2623dd6 Compare August 9, 2025 19:31

phantom5125 changed the title ~~[Feature] Add TTL-based cleanup for terminated RayJob/RayCluster metrics~~ [Feature] Add cleanup for terminated RayJob/RayCluster metrics Aug 9, 2025

phantom5125 marked this pull request as ready for review August 9, 2025 19:56

troychiu reviewed Aug 12, 2025

View reviewed changes

ray-operator/controllers/ray/metrics/ray_cluster_metrics.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved

phantom5125 added 2 commits August 14, 2025 00:12

Use Delete API for rayClusterProvisionedDurationSeconds

fdd6ec2

Add helper function to avoid nil MetricsManager

8daf870

phantom5125 requested a review from troychiu August 13, 2025 16:17

win5923 reviewed Aug 13, 2025

View reviewed changes

phantom5125 added 2 commits August 14, 2025 00:31

Fix wrong annotation

091125b

test: add helper function for executing metrics requests

0aa7816

troychiu reviewed Aug 13, 2025

View reviewed changes

phantom5125 added 3 commits August 14, 2025 20:06

Add comment for Delete API

0196b10

refactor the helper

ad4809b

Merge branch 'ray-project:master' into master

e7440da

phantom5125 requested a review from troychiu August 14, 2025 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

Uh oh!

phantom5125 commented Aug 7, 2025 •

edited

Loading

Uh oh!

troychiu commented Aug 7, 2025

Uh oh!

phantom5125 commented Aug 8, 2025

Uh oh!

troychiu commented Aug 9, 2025 •

edited

Loading

Uh oh!

phantom5125 commented Aug 9, 2025

Uh oh!

phantom5125 commented Aug 10, 2025

Uh oh!

troychiu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

troychiu Aug 13, 2025

Uh oh!

phantom5125 Aug 14, 2025

Uh oh!

troychiu Aug 14, 2025

Uh oh!

win5923 Aug 14, 2025 •

edited

Loading

Uh oh!

phantom5125 Aug 14, 2025

Uh oh!

win5923 Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

Are you sure you want to change the base?

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

Uh oh!

Conversation

phantom5125 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

End-to-end test example

Checks

Uh oh!

troychiu commented Aug 7, 2025

Uh oh!

phantom5125 commented Aug 8, 2025

Uh oh!

troychiu commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phantom5125 commented Aug 9, 2025

Uh oh!

phantom5125 commented Aug 10, 2025

Uh oh!

troychiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

troychiu Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

phantom5125 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

troychiu Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

win5923 Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phantom5125 Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

win5923 Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

phantom5125 commented Aug 7, 2025 •

edited

Loading

troychiu commented Aug 9, 2025 •

edited

Loading

win5923 Aug 14, 2025 •

edited

Loading

win5923 Aug 14, 2025 •

edited

Loading